Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema for Airline Reporting Carrier On-Time Performance Dataset #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

djalova
Copy link
Collaborator

@djalova djalova commented Dec 8, 2020

No description provided.

@xuhdev
Copy link
Collaborator

xuhdev commented Dec 15, 2020

Could you resolve the conflicts? Seems like this is now unblocked

@edwardleardi
Copy link
Collaborator

This dataset is 81GB. Should we hold off on adding it until we add code that properly handles datasets that can't be fully loaded into memory?

We probably want to implement download resuming and maybe use some dependency like Dask for loading in large Pandas dataframes.

@xuhdev
Copy link
Collaborator

xuhdev commented Dec 17, 2020

@edwardleardi Yes, I agree. How about we make this PR independent from the monolithic issue?

@xuhdev xuhdev added the dataset Add or update one or a few specific datasets label Dec 17, 2020
@djalova
Copy link
Collaborator Author

djalova commented Dec 17, 2020

The problem I had with this one was I had trouble running the tests locally. I can try running them again on a machine with more storage.

@edwardleardi
Copy link
Collaborator

edwardleardi commented Dec 17, 2020

@djalova yeah the tests in this repo actually download and load every dataset part of the dataset schema. I don't think we'll ever get this to pass without actually handling for large datasets like this. The way things stand currently we would need a machine with 81GB memory to load in the dataset when running the test. Plus it would need to download the whole dataset, so I don't think we would want to be doing that for every test.

@edwardleardi
Copy link
Collaborator

@xuhdev Yeah we should definitely make it independent.

How do you think we should approach implementing features related to loading large datasets in the future? It would probably be on a loader dependent basis right? Since loading large datasets depends on the dataset type and the Python object you want to load into right?

What do you think about creating a new epic for loaders? We can add issues for adding an initial loader implementation like how we did with CSVPandasLoader and then separate issues for getting that loader to handle large datasets.

@bdwyer2
Copy link
Collaborator

bdwyer2 commented Dec 17, 2020

The way things stand currently we would need a machine with 81GB memory to load in the dataset when running the test.

It seems like the need for more beefy CI/CD infrastructure keeps coming up. Let's make a note to discuss this when we get back from the holidays next year.

@xuhdev
Copy link
Collaborator

xuhdev commented Dec 17, 2020

@edwardleardi I don't see this issue as very urgent -- most people use the dataset would have a large RAM in place, otherwise the dataset may not be very useful depending on the use case (in case if even a subdataset can't be loaded). If we truly want some hard disk exchange stuff, we may add a CSVSqliteLoader, which loads the csv file to a sqlite database and allows the user to manipulate the database thenceforth. This would introduce sqlite as an additional dependency, and it's perhaps best done as a separate package.

About the epic, sure, after first release, we definitely would split different kinds of tasks to different epics if that would help management easier. Currently we only do pre-release and post-release distinction, perhaps because we don't focus on what would happen after the first release yet.

@xuhdev
Copy link
Collaborator

xuhdev commented Dec 17, 2020

Large dataset support in general would be interesting, and should the demand arise we should do something to keep up with loading a large dataset, it's just unclear to me yet that what we should do about it. Perhaps we can open an issue and revisit later?

@edwardleardi
Copy link
Collaborator

Currently we only do pre-release and post-release distinction, perhaps because we don't focus on what would happen after the first release yet.

Got it, good idea

Perhaps we can open an issue and revisit later?

Opened here: https://app.zenhub.com/workspaces/pydax-5fcfdd73254483001e3f3b55/issues/codait/pydax/100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Add or update one or a few specific datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants